InΒ [743]:
from IPython.display import display, HTML
display(HTML("<style>.jp-Notebook {width: 70% !important; margin: auto !important;} table {margin: 0 auto !important; margin-top: 20px !important;}</style>"))
What does the dataset contain? ΒΆ
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
InΒ [Β ]:
import pandas as pd
from IPython.display import display, Markdown
data = {
"Feature Index": [
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"
],
"Feature Name": [
"ID number",
"Diagnosis (M = malignant, B = benign)",
"Radius (mean of distances from center to points on the perimeter)",
"Texture (standard deviation of gray-scale values)",
"Perimeter",
"Area",
"Smoothness (local variation in radius lengths)",
"Compactness (perimeter^2 / area - 1.0)",
"Concavity (severity of concave portions of the contour)",
"Concave points (number of concave portions of the contour)",
"Symmetry",
"Fractal dimension ('coastline approximation' - 1)"
]
}
df_features = pd.DataFrame(data)
df_features.columns = ["Feature Index", "Feature Name"]
styled_table = (
df_features.style
.set_caption("Breast Cancer Wisconsin Dataset Features")
.set_table_styles([
{'selector': 'th', 'props': [('text-align', 'left')]},
{'selector': 'td', 'props': [('text-align', 'left')]},
{'selector': 'caption', 'props': [('caption-side', 'top'), ('font-weight', 'bold')]}
])
)
display(styled_table)
| Β | Feature Index | Feature Name |
|---|---|---|
| 0 | 1 | ID number |
| 1 | 2 | Diagnosis (M = malignant, B = benign) |
| 2 | 3 | Radius (mean of distances from center to points on the perimeter) |
| 3 | 4 | Texture (standard deviation of gray-scale values) |
| 4 | 5 | Perimeter |
| 5 | 6 | Area |
| 6 | 7 | Smoothness (local variation in radius lengths) |
| 7 | 8 | Compactness (perimeter^2 / area - 1.0) |
| 8 | 9 | Concavity (severity of concave portions of the contour) |
| 9 | 10 | Concave points (number of concave portions of the contour) |
| 10 | 11 | Symmetry |
| 11 | 12 | Fractal dimension ('coastline approximation' - 1) |
Exploratory Data Analysis ΒΆ
InΒ [715]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
InΒ [716]:
df = pd.read_csv('data.csv')
InΒ [717]:
df.describe()
Out[717]:
| id | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5.690000e+02 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 0.0 |
| mean | 3.037183e+07 | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | ... | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 | NaN |
| std | 1.250206e+08 | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | ... | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 | NaN |
| min | 8.670000e+03 | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | ... | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 | NaN |
| 25% | 8.692180e+05 | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | ... | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 | NaN |
| 50% | 9.060240e+05 | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | ... | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 | NaN |
| 75% | 8.813129e+06 | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | ... | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 | NaN |
| max | 9.113205e+08 | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | ... | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 | NaN |
8 rows Γ 32 columns
All the relevant features are floats and there is an unnecessary feature called "Unnamed: 32" that seems to appear due to this issue with read_csv(): https://www.kaggle.com/discussions/general/354943
InΒ [718]:
df.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)
df.head()
Out[718]:
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
| 2 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
| 3 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
| 4 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows Γ 31 columns
We drop the "id" and "Unnamed: 32" columns since thery are irrelevant.
InΒ [719]:
def diagnosis_value(diagnosis):
return 1 if diagnosis == 'M' else 0
Encode the target feature numerically.
InΒ [720]:
df['diagnosis'] = df['diagnosis'].apply(diagnosis_value)
df.head()
Out[720]:
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
| 2 | 1 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
| 3 | 1 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
| 4 | 1 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows Γ 31 columns
Applied the numerical encoding of the target variable to the dataset.
InΒ [721]:
df.isnull().values.any()
Out[721]:
np.False_
There are no null values.
InΒ [722]:
# Plot histograms for each feature
df.hist(bins=15, figsize=(20, 15), layout=(6, 6))
plt.tight_layout()
plt.show()
The dataset is slightly imbalanced.
Most features seem to follow a normal distribution (perhaps with a positive skewness).
For the rest it's feasible to apply a log-transform.
InΒ [723]:
skewness_arr = df.skew().sort_values(ascending=False)
skewness_arr = skewness_arr[skewness_arr > 2]
print(skewness_arr)
skewness_arr = skewness_arr.index.tolist()
# Add 1 to every skewed data since log(0) is undefined and log(epsilon) is large
df[skewness_arr] = df[skewness_arr].apply(lambda x: np.log(x + 1))
df[df.columns.difference(['diagnosis'])] = preprocessing.StandardScaler().fit_transform(df[df.columns.difference(['diagnosis'])])
area_se 5.447186 concavity_se 5.110463 fractal_dimension_se 3.923969 perimeter_se 3.443615 radius_se 3.088612 smoothness_se 2.314450 symmetry_se 2.195133 dtype: float64
Select features with a large positive skewness to apply log-transform.
After that, standardize every feature (except diagnosis) with StandardScaler().
After that, standardize every feature (except diagnosis) with StandardScaler().
InΒ [724]:
df.hist(bins=15, figsize=(20, 15), layout=(6, 6))
plt.tight_layout()
plt.show()
InΒ [725]:
# Plot the correlation matrix
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5).figure.set_size_inches(20, 10)
plt.title('Correlation Matrix')
plt.show()
The eventual diagnosis correlates well with the "size" (perimeter, radius, area) and concavity of the nucleus.
Generally, the dataset seems to be highly correlated, altough there are several irrelevant correlations between features that measure roughly the same thing (mean, worst, se for every feature).
InΒ [726]:
# Plot pairplot for a subset of features
sns.pairplot(
df[["diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean",
"smoothness_mean", "compactness_mean", "concavity_mean", "concave points_mean", "symmetry_mean",
"fractal_dimension_mean" ]],
hue = "diagnosis",
palette={1: 'orange', 0: 'blue'}
)
plt.show()
Overall, the data is well separated. The only features that are not well-separated are fractal_dimension, symmetry, smoothness and texture.
We can also observe that the larger the nucleus is (measured by the radius, perimeter and area) there is a higher probability of having malignant breast cancer.
Concavity shows similar properties but there are a few outliers.
We can also observe that the larger the nucleus is (measured by the radius, perimeter and area) there is a higher probability of having malignant breast cancer.
Concavity shows similar properties but there are a few outliers.
It is clear from the EDA that the dataset is of good quality: no missing data, well-separated, the features are well-distributed and it is only slightly imbalanced.
The only thing that might be concerning is the large number of features which results in the dataset having a high dimensionality. However, we will address this issue later with dimension reduction.
The only thing that might be concerning is the large number of features which results in the dataset having a high dimensionality. However, we will address this issue later with dimension reduction.
Training ΒΆ
In general, we will use cross-validation and a Grid/Random search for hyperparameter optimalization.
Several models are featured including simple (e.g.: KNN) and more robust ones as well (e.g.: Neural Networks, Random Forest etc.).
The following models are tested: KNN, Neural Networks, Logistic Regression and Random Forest.
F1-score is used as a metric since both false positives and false negatives are important.
Several models are featured including simple (e.g.: KNN) and more robust ones as well (e.g.: Neural Networks, Random Forest etc.).
The following models are tested: KNN, Neural Networks, Logistic Regression and Random Forest.
F1-score is used as a metric since both false positives and false negatives are important.
InΒ [727]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import pprint
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
class ModelTrain:
def __init__(self, df):
X = df.copy()
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
self.X_train = X_train
self.X_test = X_test
self.y_train = y_train
self.y_test = y_test
self.X = X
self.y = y
self.cv = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
def __train(self, model):
model.fit(self.X_train, self.y_train)
best_model = model.best_estimator_
return best_model, best_model.predict(self.X_test)
def __grid_search(self, model, param_grid):
return GridSearchCV(
estimator = model,
param_grid = param_grid,
cv = self.cv,
scoring = 'f1',
n_jobs = -1
)
def __res(self, best_model, y_pred):
return (best_model,
classification_report(self.y_test, y_pred, output_dict=True, target_names=['B', 'M']),
classification_report(self.y_test, y_pred, target_names=['B', 'M']))
def KNN(self):
knn = KNeighborsClassifier()
param_grid = {
'n_neighbors': [3, 5, 7, 9, 11],
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski']
}
grid_search_model = self.__grid_search(knn, param_grid)
best_model, y_pred = self.__train(grid_search_model)
return self.__res(best_model, y_pred)
def NeuralNetwork(self):
mlp = MLPClassifier(random_state=1, max_iter=10000)
param_grid = {
'hidden_layer_sizes': [(8,), (16,), (32,), (16, 16)],
'activation': ['tanh', 'relu'],
'solver': ['adam', 'sgd'],
'learning_rate_init': [0.001, 0.01, 0.1],
'alpha': [0.0001, 0.001, 0.01],
}
grid_search_model = self.__grid_search(mlp, param_grid)
best_model, y_pred = self.__train(grid_search_model)
return self.__res(best_model, y_pred)
def LogisticRegression(self):
logreg = LogisticRegression()
logreg.fit(self.X_train, self.y_train)
y_pred = logreg.predict(self.X_test)
return self.__res(logreg, y_pred)
def RandomForest(self):
rf = RandomForestClassifier(random_state=42)
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'bootstrap': [True, False],
}
grid_search_model = self.__grid_search(rf, param_grid)
best_model, y_pred = self.__train(grid_search_model)
return self.__res(best_model, y_pred)
mt = ModelTrain(df)
KNN ΒΆ
InΒ [728]:
best_model_knn, _, report_display = mt.KNN()
print(report_display)
precision recall f1-score support
B 0.96 0.97 0.97 71
M 0.95 0.93 0.94 43
accuracy 0.96 114
macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114
Neural Network ΒΆ
InΒ [729]:
best_model_nn, _, report_display = mt.NeuralNetwork()
print(report_display)
precision recall f1-score support
B 0.97 0.99 0.98 71
M 0.98 0.95 0.96 43
accuracy 0.97 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114
Logistic Regression ΒΆ
InΒ [730]:
best_model_logres, _, report_display = mt.LogisticRegression()
print(report_display)
precision recall f1-score support
B 0.99 0.99 0.99 71
M 0.98 0.98 0.98 43
accuracy 0.98 114
macro avg 0.98 0.98 0.98 114
weighted avg 0.98 0.98 0.98 114
Random Forest ΒΆ
InΒ [731]:
best_model_ranfor, _, report_display = mt.RandomForest()
print(report_display)
precision recall f1-score support
B 0.96 0.97 0.97 71
M 0.95 0.93 0.94 43
accuracy 0.96 114
macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114
After training and testing several models we can conclude that all of them perform pretty well on the dataset, achieving a precision > 95%.
However, Logistic Regression stands out from the tested models, achieving the highest score of 99%. Surprisingly, a simple model such as KNN beats a much more complex Neural Network classifier.
However, Logistic Regression stands out from the tested models, achieving the highest score of 99%. Surprisingly, a simple model such as KNN beats a much more complex Neural Network classifier.
Further Questions ΒΆ
Now, we look at a few further questions worth investigating.
Which features are the most important? ΒΆ
InΒ [732]:
coefficients = pd.DataFrame(best_model_logres.coef_.flatten(), mt.X.columns, columns=['Coefficient'])
coefficients = coefficients.sort_values(by='Coefficient', ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x='Coefficient', y=coefficients.index, data=coefficients, hue=coefficients.index, palette='viridis', legend=False)
plt.title('Feature Importances from Logistic Regression')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
By extracting Logistic Regression's coefficient matrix we can see which features were the most important (aka had the largest coefficients) in the linear combination.
Among the most relevant features are texture, radius, area and perimeter.
Among the most relevant features are texture, radius, area and perimeter.
InΒ [733]:
cols = list(df.columns)
cols.remove('diagnosis')
# for pair in sorted(zip(cols, best_model_ranfor.feature_importances_), key=lambda x: x[1], reverse=True):
# # print(f"{pair[0]}: {round(pair[1], 3)}")
# pass
feature_importances = best_model_ranfor.feature_importances_
feature_df = pd.DataFrame({
'Feature': cols,
'Importance': feature_importances
})
feature_df = feature_df.sort_values(by='Importance', ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_df, hue='Feature', palette='viridis', legend=False)
plt.title('Feature Importances from Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
We can use the Random Forest model's .feature_importances_ property to see which features correlate the most with having malignant breast cancer.
The top features are concavity, area, radius and perimeter and this aligns well with the biological interpretation since large radius and irregular texture often indicate malignancy because malignant tumors tend to grow and invade surrounding tissue.
The top features are concavity, area, radius and perimeter and this aligns well with the biological interpretation since large radius and irregular texture often indicate malignancy because malignant tumors tend to grow and invade surrounding tissue.
We can observe that both Logistic Regression and Random Forest consider roughly the same features as important. However, the coefficients in Logistic Regression can be negative (since they are just real numbers in the linear combination) while the feature importance in Random Forest is always nonnegative (since it does not measure a linear relationship but a contribution of a feature to reducing impurity).
Messing with the dataset (aka which model is the best with a bad quality dataset?) ΒΆ
As we concluded earlier, the dataset has several properties that make it easy to train and work with. Thus, a natural question arises: what if the data is imprecise and bad quality?
So we deliberately made the initial dataset imperfect and tested a few models on the new data.
So we deliberately made the initial dataset imperfect and tested a few models on the new data.
InΒ [734]:
noisy_df = df.copy()
noise_level = 0.05
numeric_columns = noisy_df.loc[:, noisy_df.columns != 'diagnosis']
for col in numeric_columns:
col_range = noisy_df[col].max() - noisy_df[col].min()
noise = np.random.uniform(-noise_level * col_range, noise_level * col_range, size=noisy_df.shape[0])
noisy_df[col] += noise
First, we added 5% of uniformly generated random noise to each feature, proportional to its range.
InΒ [735]:
columns_to_keep = [col for col in noisy_df.columns if col.endswith('_mean')]
X_noisy = noisy_df[columns_to_keep]
X_noisy.head()
Out[735]:
| radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | fractal_dimension_mean | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.924103 | -1.757231 | 1.011019 | 1.260656 | 1.593912 | 3.291032 | 2.719108 | 2.652186 | 2.458015 | 2.340878 |
| 1 | 2.000094 | -0.335667 | 1.468144 | 1.975012 | -0.445580 | -0.404057 | -0.287819 | 0.735925 | 0.313804 | -1.108398 |
| 2 | 1.559854 | 0.478012 | 1.695131 | 1.704445 | 0.646808 | 1.118353 | 1.291260 | 2.141907 | 0.644783 | -0.456402 |
| 3 | -0.858307 | 0.296378 | -0.606947 | -0.584121 | 3.270515 | 3.616625 | 2.147613 | 1.220999 | 2.864877 | 4.855046 |
| 4 | 1.916440 | -1.204862 | 1.785224 | 1.535342 | 0.337097 | 0.664072 | 1.503881 | 1.287655 | -0.124872 | -0.287465 |
Then all the features were removed except those of "mean" type. Our hypothesis was that the mean captures information about the feature well enough so "se" and "worst" are unnecessary.
InΒ [736]:
correlation_matrix_selected = df[columns_to_keep].corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix_selected, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix for Mean Features')
plt.show()
Plotting the correlation matrix again, we see that the data is still highly correlated.
InΒ [737]:
df_with_missing = noisy_df.copy()
missing_percentage = 0.3 # 30% missing values
columns_to_modify = df_with_missing.loc[:, df_with_missing.columns != 'diagnosis']
for col in columns_to_modify:
num_missing = int(len(df_with_missing[col]) * missing_percentage)
missing_indices = np.random.choice(df_with_missing.index, size=num_missing, replace=False)
df_with_missing.loc[missing_indices, col] = np.nan
df_with_missing.head()
Out[737]:
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.924103 | NaN | 1.011019 | NaN | 1.593912 | 3.291032 | 2.719108 | NaN | NaN | ... | 2.082082 | NaN | 2.007561 | 1.722426 | 1.635985 | 2.732424 | 1.953923 | 2.424034 | 2.748584 | 2.120955 |
| 1 | 1 | 2.000094 | -0.335667 | 1.468144 | NaN | -0.445580 | -0.404057 | -0.287819 | NaN | 0.313804 | ... | 1.947317 | -0.520197 | 1.622273 | 1.534513 | NaN | NaN | -0.057519 | NaN | -0.606162 | 0.516005 |
| 2 | 1 | 1.559854 | 0.478012 | NaN | 1.704445 | NaN | 1.118353 | 1.291260 | 2.141907 | 0.644783 | ... | 1.775112 | NaN | 1.621481 | 1.660115 | NaN | NaN | 0.902021 | 2.038613 | 0.883453 | NaN |
| 3 | 1 | -0.858307 | 0.296378 | -0.606947 | -0.584121 | 3.270515 | NaN | 2.147613 | NaN | NaN | ... | -0.217726 | -0.124007 | -0.142061 | -0.702994 | 3.375338 | 3.976404 | 1.822231 | 2.215575 | 5.789378 | 4.690161 |
| 4 | 1 | 1.916440 | -1.204862 | 1.785224 | 1.535342 | NaN | 0.664072 | NaN | 1.287655 | -0.124872 | ... | 1.433691 | -1.239162 | 1.309944 | 0.997753 | 0.503924 | NaN | NaN | 0.826369 | -0.931863 | -0.539437 |
5 rows Γ 31 columns
Next, we replace 30% of the exsiting data with NaN values so that it resembles more to a real-life dataset.
InΒ [738]:
imp = IterativeImputer(max_iter=10, random_state=0)
df_with_missing[:] = imp.fit_transform(df_with_missing)
df_with_missing = pd.DataFrame(df_with_missing, columns=df.columns)
rows_to_drop = df.sample(frac=0.3, random_state=42).index
df_with_missing = df.drop(rows_to_drop)
df_with_missing.head()
Out[738]:
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1.829821 | -0.353632 | 1.685955 | 1.908708 | -0.826962 | -0.487072 | -0.023846 | 0.548144 | 0.001392 | ... | 1.805927 | -0.369203 | 1.535126 | 1.890489 | -0.375612 | -0.430444 | -0.146749 | 1.087084 | -0.243890 | 0.281190 |
| 3 | 1 | -0.768909 | 0.253732 | -0.592687 | -0.764464 | 3.283553 | 3.402909 | 1.915897 | 1.451707 | 2.867383 | ... | -0.281464 | 0.133984 | -0.249939 | -0.550021 | 3.394275 | 3.893397 | 1.989588 | 2.175786 | 6.046041 | 4.935010 |
| 4 | 1 | 1.750297 | -1.151816 | 1.776573 | 1.826229 | 0.280372 | 0.539340 | 1.371011 | 1.428493 | -0.009560 | ... | 1.298575 | -1.466770 | 1.338539 | 1.220724 | 0.220556 | -0.313395 | 0.613179 | 0.729259 | -0.868353 | -0.397100 |
| 5 | 1 | -0.476375 | -0.835335 | -0.387148 | -0.505650 | 2.237421 | 1.244335 | 0.866302 | 0.824656 | 1.005402 | ... | -0.165498 | -0.313836 | -0.115009 | -0.244320 | 2.048513 | 1.721616 | 1.263243 | 0.905888 | 1.754069 | 2.241802 |
| 7 | 1 | -0.118517 | 0.358450 | -0.072867 | -0.218965 | 1.604049 | 1.140102 | 0.061026 | 0.281950 | 1.403355 | ... | 0.163763 | 0.401048 | 0.099449 | 0.028859 | 1.447961 | 0.724786 | -0.021054 | 0.624196 | 0.477640 | 1.726435 |
5 rows Γ 31 columns
Next, we used IterativeImputer to fill the missing values. Also randomly drop 30% of the data.
InΒ [739]:
def i_to_name(i):
if i == 0:
return "Neural Network"
elif i == 1:
return "Logistic Regression"
return "Random Forest"
results = []
for _ in range(5):
num_columns_to_select = np.random.randint(2, 5)
random_columns = np.random.choice(df_with_missing.loc[:, df_with_missing.columns != 'diagnosis'].columns, size=num_columns_to_select, replace=False)
random_columns = np.append(random_columns, 'diagnosis')
df_sel = df[random_columns]
mt_noisy = ModelTrain(df_sel)
models = [mt_noisy.NeuralNetwork, mt_noisy.LogisticRegression, mt_noisy.RandomForest]
print("Training with the following features:", df_sel.columns)
for i in range(len(models)):
best_model, report, report_display = models[i]()
f1_benign = report['B']['f1-score']
f1_malignant = report['M']['f1-score']
f1_avg = (f1_benign + f1_malignant) / 2
results.append({
'features': ', '.join(df_sel.columns),
'model': i_to_name(i),
'f1_score': f1_avg
})
df_results = pd.DataFrame(results)
plt.figure(figsize=(10, 6))
sns.barplot(data=df_results, x='features', y='f1_score', hue='model')
plt.xticks(rotation=45, ha='right')
plt.xlabel('Feature Subset')
plt.ylabel('F1 Score')
plt.title('Model Performance by Feature Subset')
plt.tight_layout()
plt.show()
Training with the following features: Index(['texture_se', 'area_se', 'compactness_mean', 'diagnosis'], dtype='object')
Training with the following features: Index(['area_worst', 'compactness_se', 'symmetry_se', 'perimeter_mean',
'diagnosis'],
dtype='object')
Training with the following features: Index(['texture_worst', 'perimeter_mean', 'diagnosis'], dtype='object')
Training with the following features: Index(['concave points_mean', 'radius_se', 'compactness_worst',
'texture_worst', 'diagnosis'],
dtype='object')
Training with the following features: Index(['texture_mean', 'area_se', 'fractal_dimension_mean', 'diagnosis'], dtype='object')
Conclusion ΒΆ
At the end we make the following remarks:
- The dataset is outstandingly good: even simple models such as KNN achieves a 95%+ result, while more complex models are close to 100%.
- There are only a handful of relevant features that actually correspond to the classification: radius, area, perimeter and concavity. This also aligns well with the biological interpretation since the larger the nucelus is the more probable it is that the patient has breast cancer (since malignant cells tend to invade their surroundings).
- Even after deliberately deteriorating the quality of the dataset the tested models still achieved 90%+ accuracy (with f1-score).